This notebook demonstrate how to use ML Workbench to create a regression model that accepts numeric and categorical data. This one shows "cloud run" mode, which does each step in Google Cloud Platform with various services. Cloud run can be distributed so it can handle large data without being restricted on memory, computation, or disk limits. The notebook is similar to last one (Taxi Fare Model (small data)), but it uses full data (about 77M instances).
There are only a few things that need to change between "local run" and "cloud run":
Other than this, nothing else changes from local to cloud!
Note: "Run all cells" does not work for this notebook because the steps are asynchonous. In many steps it submits a cloud job, and you should track the status by following the job link.
Execution of this notebook requires Google Datalab (see setup instructions).
We will use Chicago Taxi Trip Data. Using pickup location, drop off location, taxi company, the model we will build predicts the trip fare.
In [27]:
%%bq query --name texi_query_eval
SELECT
unique_key,
fare,
CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
pickup_latitude,
pickup_longitude,
dropoff_latitude,
dropoff_longitude,
company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
fare > 2.0 AND fare < 200.0 AND
pickup_latitude IS NOT NULL AND
pickup_longitude IS NOT NULL AND
dropoff_latitude IS NOT NULL AND
dropoff_longitude IS NOT NULL AND
MOD(ABS(FARM_FINGERPRINT(unique_key)), 100) < 5
In [28]:
%%bq query --name texi_query_train
SELECT
unique_key,
fare,
CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
pickup_latitude,
pickup_longitude,
dropoff_latitude,
dropoff_longitude,
company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
fare > 2.0 AND fare < 200.0 AND
pickup_latitude IS NOT NULL AND
pickup_longitude IS NOT NULL AND
dropoff_latitude IS NOT NULL AND
dropoff_longitude IS NOT NULL AND
MOD(ABS(FARM_FINGERPRINT(unique_key)), 100) >= 5
Create "chicago_taxi.train" and "chicago_taxi.eval" BQ tables to store results.
In [29]:
%%bq datasets create --name chicago_taxi
In [30]:
%%bq execute
query: texi_query_eval
table: chicago_taxi.eval
mode: overwrite
Out[30]:
In [31]:
%%bq execute
query: texi_query_train
table: chicago_taxi.train
mode: overwrite
Out[31]:
Sanity check on the data.
In [32]:
%%bq query
SELECT count(*) FROM chicago_taxi.train
Out[32]:
In [10]:
%%bq query
SELECT count(*) FROM chicago_taxi.eval
Out[10]:
The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the data in BigQuery tables and build a regression model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.
For details of each command, run with --help. For example, "%%ml train --help".
This notebook will run the analyze, transform, and training steps in cloud with services. Notice the "--cloud" flag is set for each step.
In [3]:
import google.datalab.contrib.mlworkbench.commands # this loads the %%ml commands
In [35]:
%%ml dataset create
name: taxi_data_full
format: bigquery
train: chicago_taxi.train
eval: chicago_taxi.eval
In [ ]:
!gsutil mb gs://datalab-chicago-taxi-demo # Create a Storage Bucket to store results.
In [ ]:
!gsutil rm -r -f gs://datalab-chicago-taxi-demo/analysis # Remove previous analysis results if any
In [38]:
%%ml analyze --cloud
output: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_full
features:
unique_key:
transform: key
fare:
transform: target
company:
transform: embedding
embedding_dim: 10
weekday:
transform: one_hot
day:
transform: one_hot
hour:
transform: one_hot
pickup_latitude:
transform: scale
pickup_longitude:
transform: scale
dropoff_latitude:
transform: scale
dropoff_longitude:
transform: scale
The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training.
The transform is required if your source data is in BigQuery table.
We run the transform step for the training and eval data.
In [ ]:
!gsutil -m rm -r -f gs://datalab-chicago-taxi-demo/transform # Remove previous transform results if any.
Transform takes about 6 hours in cloud. Data is fairely big (33GB) and processing locally on a single VM would be much longer.
In [40]:
%%ml transform --cloud
output: gs://datalab-chicago-taxi-demo/transform
analysis: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_full
In [5]:
!gsutil list gs://datalab-chicago-taxi-demo/transform/eval-*
In [5]:
%%ml dataset create
name: taxi_data_transformed
format: transformed
train: gs://datalab-chicago-taxi-demo/transform/train-*
eval: gs://datalab-chicago-taxi-demo/transform/eval-*
In [ ]:
!gsutil -m rm -r -f gs://datalab-chicago-taxi-demo/train # Remove previous training results.
Training takes about 30 min with "STANRDARD_1" scale_tier. Note that we will perform 1M steps. This will take much longer if we run it locally on Datalab's VM. With CloudML Engine, it runs training in a distributed way with multiple VMs, so it runs much faster.
In [6]:
%%ml train --cloud
output: gs://datalab-chicago-taxi-demo/train
analysis: gs://datalab-chicago-taxi-demo/analysis
data: taxi_data_transformed
model_args:
model: dnn_regression
hidden-layer-size1: 400
hidden-layer-size2: 200
train-batch-size: 1000
max-steps: 1000000
cloud_config:
region: us-east1
scale_tier: STANDARD_1
In [ ]:
# Delete previous results
!gsutil -m rm -r gs://datalab-chicago-taxi-demo/batch_prediction
Currently, batch_prediction service does not work with BigQuery data. So we export eval data to csv file.
In [9]:
%%bq extract
table: chicago_taxi.eval
format: csv
path: gs://datalab-chicago-taxi-demo/eval.csv
Run batch prediction. Note that we use evaluation_model because it takes input data with target (truth) column.
In [8]:
%%ml batch_predict --cloud
model: gs://datalab-chicago-taxi-demo/train/evaluation_model
output: gs://datalab-chicago-taxi-demo/batch_prediction
format: csv
data:
csv: gs://datalab-chicago-taxi-demo/eval.csv
cloud_config:
region: us-east1
Once batch prediction is done, check results files. Batch prediction service outputs to JSON files.
In [14]:
!gsutil list -l -h gs://datalab-chicago-taxi-demo/batch_prediction
We can load the results back to BigQuery.
In [10]:
%%bq load
format: json
mode: overwrite
table: chicago_taxi.eval_results
path: gs://datalab-chicago-taxi-demo/batch_prediction/prediction.results*
schema:
- name: unique_key
type: STRING
- name: predicted
type: FLOAT
- name: target
type: FLOAT
With data in BigQuery can do some query analysis. For example, RMSE.
In [11]:
%%ml evaluate regression
bigquery: chicago_taxi.eval_results
Out[11]:
From above, the results are better than local run with sampled data. RMSE reduced by 2.5%, MAE reduced by around 20%. Average absolute error reduced by around 30%.
Select top results sorted by error.
In [12]:
%%bq query
SELECT
predicted,
target,
ABS(predicted-target) as error,
s.*
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s
ON r.unique_key = s.unique_key
ORDER BY error DESC
LIMIT 10
Out[12]:
There is also a feature slice visualization component designed for viewing evaluation results. It shows correlation between features and prediction results.
In [40]:
%%bq query --name error_by_hour
SELECT
COUNT(*) as count,
hour as feature,
AVG(ABS(predicted - target)) as avg_error,
STDDEV(ABS(predicted - target)) as stddev_error
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s
ON r.unique_key = s.unique_key
GROUP BY hour
In [44]:
# Note: the interactive output is replaced with a static image so it displays well in github.
# Please execute this cell to see the interactive component.
from google.datalab.ml import FeatureSliceView
FeatureSliceView().plot(error_by_hour)
Out[44]:
In [42]:
%%bq query --name error_by_weekday
SELECT
COUNT(*) as count,
weekday as feature,
AVG(ABS(predicted - target)) as avg_error,
STDDEV(ABS(predicted - target)) as stddev_error
FROM `chicago_taxi.eval_results` as r
JOIN `chicago_taxi.eval` as s
ON r.unique_key = s.unique_key
GROUP BY weekday
In [45]:
# Note: the interactive output is replaced with a static image so it displays well in github.
# Please execute this cell to see the interactive component.
from google.datalab.ml import FeatureSliceView
FeatureSliceView().plot(error_by_weekday)
Out[45]:
What we can see from above charts is that model performs worst in hour 5 and 6 (why?), and best on Sundays (less traffic?).
In [ ]:
!gsutil -m rm -rf gs://datalab-chicago-taxi-demo
In [ ]: